DOCUMENT RESUME 



ED 279 682 



TM 870 069 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



ABSTRACT 



Herman, Joan L. 

What Do the Test Scores Really Mean? Critical Issues 

in Test Design. 

86 

13p.; One of 46 papers commissioned by the Study 
Group on the National Assessment of Student 
Achievement and cited in Appendix B to their final 
report "The Nation's Report Card" (TM 870 049). For 
other papers in this group, see TM 870 050-094. 
Viewpoints (120) 

MF01/PC01 Plus Postage. 

♦Achievement Tests; *Content Validity; *Criterion 
Referenced Tests; Educational Assessment; Educational 
Testing; Elementary Secondary Education; Measurement 
Objectives; Multiple Choice Tests; ^National Surveys; 
Test Construction; ^Testing Problems; Testing 
Programs; *Test Interpretation; Test Results; Test 
Validity 

♦National Assessment of Educational Progress 



Issues in designing valid tests for the National 
Assessment of Educational Progress (NAEP) are discussed. Test scores 
are often provided without any information on the nature of the tasks 
represented. Because test domains are defined by individual item 
writers, the generalizability between tests and items is suspect. 
While typical content validation procedures help assure that the 
included items are important, they still might not represent the full 
range of knowledge and skills constituting given domains. As a 
result, the underlying meaning of what is tested is vague, and the 
specific definition of what is to be tested escapes public scrutiny. 
This is especially important when matching particular tests and 
curricula among states. Better specification of test content and task 
structure is recommended. Elements in good task structure should 
include: task description; content limits; linguistic features; 
cognitive complexity; and format. Recent NAEP assessments defined 
four different types of context for test items: (1) scientific, (2) 
personal, (3) societal, and (4) technological. Three levels of 
cognitive complexity items- were defined: (1) knows, (2) uses, and (3) 
integrates. Six categories of subject content were specified. In 
conclusion, NAEP planners should emphasize content validity; define 
more specifically what is to be tested; provide better models for 
item construction; and assure that the entire domain is represented. 
(GDC) 



**************************************** 

* Reproductions supplied by EDRS are the best that can be made * 

from the original document. * 



*********************************************************************** 



9 

JERLC 



00 
O 

CD 



What Do the Test Scores Really Mean? 



Critical Issues in Test Design 



Joan L. Herman 



University of California-Los Angeles 



Paper commissioned by 



THE STUDY GROUP ON THE NATIONAL ASSESSMENT OF STUDENT ACHIEVEMENT 



U.S. DEPARTMENT OF EDUCATION 

Office or Educational Research and Improvement 

EDUCATIONAL RESOURCES INFORMATION 
CENTER (ERIC) 

□ This document has been reproduced as 
received from (he person or organization 
/originating it 
Minor changes have been made to improve 
v reproduction quality. 

• Points of view or opinions stated iri this docu- 
ment do not necessarily represent official 
OERI position or policy. 



1986 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



DRAFT 



WHAT DO THE TEST SCORES REALLY MEAN? 
CRITICAL ISSUES IN TEST DESIGN 



Joan L. Herman 
UCLA Center for the Study of Evaluation 



The National Assessment of Educational Progress aims to 
provide information to the public f legislators f educators and 
others about students 1 level of performance on a broad spectrum 
of significant , age appropriate knowledge and skills within a 
particular subject area. How well are students performing? What 
are they able to do? Has the level of their performance changed 
for the better or the worse? The problem,, is essentially a 
descriptive one, with the caveat that descriptions over time 
probably are equally of interest. Unfortunately, however, the 
descriptive questions of what students are able to do and of at 
what level they are able to perform cannot be sensibly answered 
without knowing the nature of what they. are being asked to do and 
without rigorous assurance that the items used to describe their 
performance adequately represent the knowledge or skill of 
interest. This observation, while axiomatic and longstanding, has 
been has not received adequate attention by those in the testing 
and measurement community. Researchers and test developers are 
quick to warn that test scores represent only estimates of a 
students 1 skills, and they have developed sophisticated models and 
elegant techniques to make those estimates more empirically 
precise and/or efficient. Their techniques, however, assume a 
well defined test domain, an assumption which frequently is 
violated raising very basic questions about what is being 
estimated and what a test score means with regard to the quality 
and level of student performance. This paper argues that NAEP's 
search for empirical precision needs to be matched by equal 
concern for conceptual precision in the specification of test 
content. It begins with a discussion of problems which arise 
when test content is not well specified, considers issues in 
assuring that NAEP tap the most significant subject areas skills, 
and recommends test specification solutions that are based in 
current research in cognition and the structure of expertise and 
that expand response alternatives beyond selected response 
options. 

The Problem 

The essential problem is this: What sense can be ,nade out of a 
test score when we do not know the nature of the task it 
represents? Or stated alternatively, how can one validly and 
reliably measure some knowledge or skill without knowing the 
nature of the domain which the measurements are supposed to 
represent? The answer, according the common test development 
practice, is that we do. not need to have a very detailed view of 
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*. ua ni- to assess in order to assess it well and make 

sensible interpretations of ?he results. Rather than starting 
IS??ori with a detailed, well grounded conception, we arrive at 
nil ^sthoc Consider the typical process: teachers, content 
everts atd/or others are assembled to generate large numbers or 
t2t « tin! in response to a very general content process matrix ; 
the Items so generated are subjected to both empirical and 
ludomental procedures; the surviving items, those which are 
udg^d representative' of something important and which are 
empirically coherent, are then assumed to adequately define the 
IZlln of interest.' We leave it to the item writers and to the 
test items themselves, in short, to defacto define the domain. 

The problem with such a process is that its base is essentially 
arbitral It aggregates the content biases and idiosyncracies of 
"dividual item writers and assumes that somehow by combining a 

grea? SSber and variety of * ndi ^ dua Y^^ol ' nenerafizabfe 
with a sound, representative domain and a set of generalizacie 
measures of that domain. The error of this /assumption is evident 
?n a number of studies that have conducted comparative analyses 
o? standardized test content (Herman and Cabello 1984;Floden et 
al iqqq Schmidt, 1983). These studies have found that aitnougn 
?h4re is brSad agreement in the various tests on the su^cales 
Ihtl are used to constitute the assessment in each basic skill 
aria there is considerable disparity in the specific skills 
which are used to represent each subscale. Thus, for example, 
m«cS standardized tests purport to measure and provide reports on 
students' P e?fo?manirin P somlthing akin to vocabulary and reading 
comprehension within reading; math concepts, computation, and 
Problem solving within math; "pitalization punctuation useage, 
and soellina within language arts; but the the types or ^e™" 
include! wiShin each scale, the relative emphases given specif ic 
skills within each area, and the specific topic coverage differs 
f«- one test'publisher to the next and 

SrSK^S^Snf 'res^ofone ies, . to another or fro™ 
one set of items to another, is suspect, and the meaning of tne 
domain inconsistent. 

Gross imperfections in generalizability and the problems they 
pose in validly interpreting test results are ^ 1 ^^? hted In W ^ 
the number of items included in an assessment is small. in tne 
extreme case, consider the NAEP writing report , Trends ^|2|| g| 
Decade. 1 974-84 (Applebee, Langer, Mullis, 1986). The report 
claTafterlleT^rends in student writing performance in three 
aSer^ genres, informative writing, P e "^ V ^^ 0 ^ 
imaginative writing. Because of changes ^.^^^^^s^sSen? 
and in administrative procedures over the three assessmem: 
periods, responses to only one prompt per genre were available to 
ctaracterLe P student performance at each ^hJ^JSutS^ 
we can derive from such findings is directly proportionate to our 
confidence that each prompt adequately represents students 
Perro?mance in each genre! To what extent « ~«h prompt 
representative in this sense? No rationale is available, either 
from an empirical standpoint, e.g. evidence to suggest that 
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students' performance on the administered prompt fell near the 
midpoint of their performance on a range of tasks within the 
genre, or from a design standpoint, e.g., rational support, 
preferably research based, for the proposition that the 
administered prompt is modally prototypical and/or based on 
apriori task requirements represents some mid level of task 
difficulty. On the contrary, we have some evidence to rsuggest 
that students' performance within genre is not stable, and that 
depending on which prompt we choose to examine, we come up with 
significantly different pictures of student skill levels . 
Looking at the 1974 and 1979 assessment of nine-year olds, for 
example, we find that the percentage of students rated minimal or 
better on task accomplishment in persuasive writing in one year 
ranges from about 35 percent to about 75 percent depending on the 
prompt chosen to characterize their performance (p. 45). (See 
figure 1) The? trend data also leads to different conclusions 
depending on the prompts selected for scrutiny. Looking at the 
performance of thirteen years olds on imaginative tasks, we find 
that two of the three prompts shew a slight- upward trend from the 
1974 to the 1979 assessments while the third, and the one on 
which the three year trend analysis is based, shifts downward 
over the same period (p. 46). The choice of items, in short, 
profoundly affects performance level interpretations, and the 
item writer(s), not the domain itself, in many ways controls the 
results and their conclusions. 

INSERT FIGURE 1 ABOUT HERE 

While one might counterargue that averaging students' 
performance over a number of items, as is typically the case in 
multiple choice tests, alleviates some of these generalizability 
problems, and certainly this would strengthen the 
interpretability of the writing example just cited, problem(s) in 
content validity still remain. Consider 9 for example, the number 
of subject area topics which are supposed to be assessed by 
NAEP's science assessment. Any single multiple choice item 
typically measures only a very miniscule fraction of the specific 
topic, and what it specifically covers is left to the discretion 
of the item writer, essentially hidden from public view. Although 
typically employed content validation procedures help to assure 
that items included on the test are considered important, what 
assurance in there that the test items represent the full range 
of important relevant content? Have items; been sampled broadly 
to be fully representative of the domain of interest or or test 
items concentrated in particular areas and in a constricted, but 
unknown, skill range? Figure 2 displays alternate pictures of 
how well a given number of test items covers important content 
within a particular topic area. Which is an accurate picture of 
current NAEP assessments? 

INSERT Figure 2 about here 

We hope, of course h for the most balanced, comprehensive 
picture. Returning to tho example of the writing study, for 
instance, it would be highly desireable to sample student 
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performance from the full range of tasks which are typical of a 
KrJioular aim in order to characterize fairly how students 
llAotm in that genre Furthermore, in order to sample, them 
adea a?elv we may well want to consider defining apriori the 
?e?lvan? boundaries and the varying task requirements which 
confute different types of tasks within the genre. We could 
thll be more assured that the specific exemplars selected for 
telting were optimally representative of the domain of nt est. 
Th" problem is most obvious in production tasks where the number 
of items sampled is small, but also exists in multiple choice 
assessments featuring large numbers of items. 

The tests of empiricial coherence typically employed ta 
development of multiple choice assessments, in fact, could 
development of »ui p balanced representation and may ins ead 

reinforce a more constricted view of a given skill or knowledge 
domain! The" difficulty of using multiple choice items to measure 
deep Understanding and the highest levels of "i£ B ^ 

frequently acknowledged. Studies have also demonstrated tne 
Vi^f+l 3 usina multiple choice items to measure nigher level 
Production skiul. ££& studies by UCLA*. Center for the Study 
of Evaluation (Spooner Smith, 1980), for example, found that 
students' performance on multiple choice tests of writing skill 
did not adequately predict their actual performance in writing, 
even when Srth measures were directed at the same analytic skill 
calegorils. (Bo?h the multiple choice test items and^the scoring 
scheme for analyzing their writing were directed at the same 
dements within the domain: use Of topic sentence, support, 
organization, usage, etc.). 

Taken together i.e., the difficulty of developing t est j-tems to 
measure the highest levels of cognitive skill and the 
Tnaleguacies of recognition items for measuring comply 
P ?odu1tion tasks,, these two observations point to an ^portant 
flaw in relying upon empirical coherence to validate a set of 
items. Within any given field test of multiple choice items, 
Jhfn w- might expect only * tew items creatively written to 
assess high-order production skills and problem-solving; 
conversely, we might expect most of the items because they are 
easier to conceive and construct, to assess lower level s^ 115 - 
These items which are empirically coherent . then may well be 
concentrated in lower levels of skill application and miss the 
most Complex aspects problem solving. On the other hand some of 
the items which are discarded as outliers »g « fact be 
capturing something of real significance, critical aspects or 
what we- re trying to measure. A constricted assessment may be the 
result. 

in summary, there are a number of problems in the des «j;P^^ 
validity off test results under traditional test construction 
procedures: the definition of the domain rests in the hands of 
Item writers and left to their collective biase f/ su £*t 
aeneraiizabilitv of the domain definitions thus is suspect. 
luJthlrm"^ 1 wLlf typical content validation Procedures help 
to assure that those things included on a test are important, 
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whether the test items are representative of the full range of 
knowledge and skill constituting given domains is moot. As a 
result, the underlying meaning of what is tested is slippery and 
the specific definition of what is to be tested escapes public 
scrutiny. ( The very general frameworks or content/process 
matrices which are used to characterize test content are not the 
subject of the latter statement; these are quite public, but 
defined at a level of abstraction where few could disagree and 
which permit a wide variety of specific test content.) 

The publicness of what is tested and the clarity and precision 
of its ^purification becomes increasingly important when the 
match betvc^n and among particular tests and/or curricula is an 
issue. For example, the equity of using NAEP in state-by-state 
comparisons rests at least partially in the match between the 
curricular intentions in each state and the NAEP items. Without 
knowing the underlying specific bases of the items, it is 
difficult to come to a meaningful determination of such a match. 
The problem of relying on "similar sounding" subscales for 
making such determinations is demonstatejd in the test content 
ai^lyses cited above. Adding fuel to the argument is a recent 
study comparing subject matter contained in state assessments 
across the country which found great diversity in the depth and 
breadth of coverage on presumably similar subscales (Bur stein, 
Baker & Aschbacher, 1985). Matches determined at this level, 
then, would be both superficial and artificial. 

Toward a Solution 

Inherent in the arguments above is a solution to the problem 
of more meaningful and interpretable test results: better 
specification of test content. This call for greater descriptive 
rigor is not new but harkens back to early advocacy for criterion 
referenced testing and later for competency tests. More 
recently, leaker and Herman (1983) have outlined a test design 
approach grounded in research in learning, instruction, and 
cognitive science and focused the definition of task structures. 
Elements specified in such structures include: 

Task description, or a general descriptor characterizing the 
nature of the knowledge, skill, or objective to be assessed; 

Content limits which circumscribe apriori the substance or 
content which is permissible for testing and the performance 
quality or level of discrimination expected, both defined by 
reference to the curriculum, consensus, and principles of 
learning and understandings of the structure of knowledge. 

Linguistic features, controlling the linguistic complexity 
of assessment so that it does not interfere with the 
construct actually being assessed; 

Cognitive complexity, or the intellectual "level" apart from 
content at which the items or targetted, operationalized in 
relation to the specified content limits; 
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Format, including both the descriptive modes in which the 
task is presented and the form in which the task is 
to be presented. 

The task structure provides a specific, descriptive and 
generalizable blueprint against which test items can be generated 
and a public and operational statement for each domain being 
assessed. 

Recent NAEP assessments have been moving toward such a 
domain specification approach and their attempts to ground 
definitions of skills in recent theory of cognition is 
commendable. But further progess is desireable, progress which 
could benefit not only the validity of NAEP assessments but also 
could provide models for local and state test development as 
well. Take, for example, the 1985-86 NAEP Science Assessment 
(NAEP, 1986). The assessment framework is a three-dimensional 
matrix defined by content, context, and.l^yel of cognition. The 
content dimension specifies six categories, including the 
traditional disciplines of scince, its ttature and processes and 
its history, and specific topics to be assessed within each. The 
context dimension defines four different types of context for 
test items: scientific, personal, societal, and technological. 

Perhaps most interesting is the cognition dimension which 
attempts to define items according to the cognitive processes 
required to deal with science content at different levels of 
cognitive complexity: 

Knows: Successful performance depends on the ability to 
recall specific facts, concepts, principles, and methods of 
science; to show f amiliarity with scientific terminology; to 
recognize these basic ideas in a different context; and to 
trans later information into other words or another format. 
This category generally involves a one-step cognitive 
process . 

Uses ; These exercises test the ability to combine factual 
knowledge with rules , formulas , and algorithms for a 
specified purpose. Successful performance depends on the 
ability to apply basic scientific facts and principles to 
concrete and/or unfamiliar situations,- to interpret 
information or data using the basic ideas of the natural 
sciences; and to recognize relationships of concepts, facts, 
and principles to phenomena observed and data collected. 
This category generally involved a two-step cognitive 
process. 

Integrates: These exercises test the ability to organize 
the component processes of problem solving and learning for 
the attainment of more complex goals. Successful 
performance depends on the ability to analyze a problem in a 
manner consistent with the body of scientific concepts and 
principles, to organize a series of logical steps, to draw 
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conclusions on the basis of available data, to evaluative 
the best procedure under specified condictions, and to 
employ other higher-order skills needed for reaching the 
solution to a problem. 

This category generally inolves multi-step cognitive 
processes. In particular, it requires such mental processes 
as generalizing; hypothesizing? interpolating and 
extrapolating, reasoning by analogy, induction and 
deduction; and synthesizing and modeling, (p. 10) 

While this is heroic attempt to operationalize the meaning of 
higher, mid- and lower level cognitive skills, because the 
boundaries of each level are not clear it falls short of its goal 
of producing a useable scheme that can be used to generate and 
categorize test items. The differentiation between use and 
integrate, in particular, is often difficult to comprehend, e.g., 
the difference between interpreting data using the basic ideas of 
the natural sciences and analyzing a problem* consistent with tne 
body of scientific concepts and principles. Apparently absent 
also is a set of instructions, exemplars and models for 
generating test items for each category. Further work in 
clarifying and better operationalizing the meaning of each of the 
categories and in validating their integrity would be important 
contributions to both the assessment and teaching of science, 
providing a common vocabulary and a. set of parameters that can 
applied to science content and that can help focus instruction 
and test development. The idea is not to specify a common set of 
science objectives for all schools, or all states, but rather to 
provide useable tools and models for stimulating the definition 
of state or locally sensitive goals and objectives and for 
generating test items covering a range of levels of cognitive 
complexity. 

Examining the extent to which the definitions of cognitive 
complexity applicable to science objectives can be generalized 
to those in the social sciences is also worthy of exploration. 
Particularly at the elementary school level, a common scheme 
across content areas would simplify practitioners work in test 
development and in instruction and might contribute as well to 
curricular integration. At the secondary level, it might 
encourage communication among professionals across disciplines and 
contribute also to curricular integration at that level. 

Coming up with more refined specifications for writing 
multiple choice items measuring higher levels of cognitive 
complexity in the various content areas would be an important 
steps in increasing the validity and interpretability of NAEP 
results, but it would not solve the problem of assuring that NAEP 
assesses the full range of skill development, including the 
highest levels tyl problem-solving and critical th "*i ng .- 
Literature cited curlier points to the limits of using multiple 
choice items. The ultimate goals are production skills, not 
recognition, which require constructed response/essay items tor 
valid assessment. 



9 



^"nature of the task that students .re 

Stt'rSLS* T"ri 2 "SiSU! 5 what^are" the defining 
S2«SS5tloS°of °tasks requiring critical thinkin|? over what 

brifawnr^^wnat ^leveTof compfexiry^uld^hf "sKs he 
constructed? *5ta? specific intent understanding or knowledge is 
ore-reouisite to task completion? Defining the nature or a 
ISreSfrl Use and r eliahle genera i^h e ruhrics "Jgjj 

IZT orinfdSSin'speSifnatioi; proLss what aspects or 
elements of the response need to be attended to? What criteria 
'shou e io tS he ImpSoyed and what rules can be constructed to reliably 

Sn^nro^des^eir^hln Sigh V S*S3Tgr use in 
assessing* content area understanding, . models which are 

Burry, 1983) 

«^*-v, t-heir multiple choice counterparts, it would be of 

acloss areas. The roots of potential solutions may w 
r in? e eflf g ence ind i n h g e S struc^K 

nSvice 9 ISd skilfed performance. As an example, research in 
SS leaning and inexpert systems has Produced P™*°°°J; au fo | 
representing and assessing knowledge structures (Dansereaus 
HolJey? 1982, Novak et al., 1983; Naveh-Benjamin, 1986). To what 
extent can these techniques be adapted and their results 
quantified to provide reliable, generalise strategies for 
assessing deeper levels of content understanding? 

This paper has argued that NAEP planners nee * J° giv % t m0 il| s 
a «.«.Int?on t-o content validity issues in test design. it nas 

NAEP advances in these areas would enhance the quality of the 
SaSonal assessment and also would provide important benefits for 
state and local practice. 
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FIGURE I* 
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Differences in Performance Levels 
Depending on Prompt Selection 
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(b) 

Percentage of 13-year Olds Rated Minimal 
or Better on Imaginative Tasks 



Prompt 3 
Prompt * 




Prompt 1 



1974 



1979 



1984 



Taken from NAEP, 1986, p. 45-46 
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